🐛 Fix flakes in e2e tests for metrics #5138

mayuka-c · 2025-10-17T05:09:58Z

By looking at the logs, looks like during the curl metrics pod creation, calls to the webhook is failing due to connection refused. So added precheck to ensure the webhook is ready before beginning with the pod creation.

Validation

Ran the e2e tests multiple times here and did not see the flakes: https://github.com/mayuka-c/kubebuilder/actions/runs/18582517449

k8s-ci-robot · 2025-10-17T05:10:09Z

Hi @mayuka-c. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-10-17T05:10:11Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mayuka-c
Once this PR has been reviewed and has the lgtm label, please assign camilamacedo86 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

camilamacedo86

Hi @mayuka-c 👋

That looks great! However, this change only makes sense for projects that actually have webhooks.
Not all projects do — we use markers to inject code (see Kubebuilder markers
).
For example:

+kubebuilder:scaffold:e2e-webhooks-checks

Then, if we really need that we will need to see if we can use the same; it seems not. Then we need a new marker. So, we need to be 1000% sure that is the best way. Ideally we should avoid create new markers and etc since we need keep the support for those, maker harder for end user customisartions, etc.

So, if we need to check whether the webhook is serving before the metrics, we should only add this logic when webhooks are present — meaning we’ll need to use a marker for it.

Also, please make sure the scaffolded code remains as generic as possible, so it still works even if users customise their configurations.

Lastly, since this change affects the scaffolded e2e tests in projects (which impact end users), we need to be careful with the commit type.
We should use the 🐛 emoji instead of 🌱 — the seed emoji is only used for changes that don’t affect users and can be ignored in the release notes.

mayuka-c · 2025-10-21T13:46:18Z

Hi @camilamacedo86
Regarding below:-

That looks great! However, this change only makes sense for projects that actually have webhooks. Not all projects do — we use markers to inject code (see Kubebuilder markers ). For example:
+kubebuilder:scaffold:e2e-webhooks-checks
Then, if we really need that we will need to see if we can use the same; it seems not. Then we need a new marker. So, we need to be 1000% sure that is the best way. Ideally we should avoid create new markers and etc since we need keep the support for those, maker harder for end user customisartions, etc.

So, if we need to check whether the webhook is serving before the metrics, we should only add this logic when webhooks are present — meaning we’ll need to use a marker for it.

Also, please make sure the scaffolded code remains as generic as possible, so it still works even if users customise their configurations.

If I'm not wrong, I could add a new fragment to do the webhook readiness check here and replace with the marker here

But the only problem that I see is, it will end up running other fragment checks too which is not required.

Shall I go ahead to define a new marker? or if you have any better solution, please do suggest :)

mayuka-c · 2025-10-21T13:47:31Z

Lastly, since this change affects the scaffolded e2e tests in projects (which impact end users), we need to be careful with the commit type. We should use the 🐛 emoji instead of 🌱 — the seed emoji is only used for changes that don’t affect users and can be ignored in the release notes.

Sorry my bad, I will update the title with proper emoji. Thank you :)

camilamacedo86 · 2025-10-21T18:51:29Z

Hi @mayuka-c

Let's create a new marker
Also, if you can review the code and ensure that is generic and good enough for any project that would be great 🎉

mayuka-c · 2025-10-22T10:43:23Z

Hi @mayuka-c

Let's create a new marker Also, if you can review the code and ensure that is generic and good enough for any project that would be great 🎉

Sure thank you @camilamacedo86 . I will review the code and make sure it is generic.

mayuka-c · 2025-10-30T12:18:21Z

Hi @camilamacedo86 👋
Sorry for getting back late, as I was bust with other things.

I have added a new marker e2e-webhook-readiness and it seems to working fine.
I see the readiness check test cases are used in projects where the webhooks are present in Projectfile and for the basic test-data project, the webhook tests are not added.

Please let me know if this is fine or if I'm missing

docs/book/src/reference/markers/scaffold.md

camilamacedo86 · 2025-10-30T15:01:26Z

testdata/project-v4-multigroup/test/e2e/e2e_test.go

+				g.Expect(err).NotTo(HaveOccurred(), "Webhook server not responding on port 443")
+			}
+			Eventually(verifyWebhookServiceReady, 2*time.Minute).Should(Succeed())
+			// +kubebuilder:scaffold:e2e-webhooks-readiness


I think we may can simplify with

By("waiting for webhook service to be ready if webhooks are configured") verifyWebhookServiceReady := func(g Gomega) { const webhookServiceName = "{{ProjectName}}-webhook-service" cmd := exec.Command("kubectl", "get", "service", webhookServiceName, "-n", namespace) _, err := utils.Run(cmd) if err != nil { // Project has no webhook service; nothing to wait on. return } cmd = exec.Command("kubectl", "get", "pods", "-l", "control-plane=controller-manager", "-n", namespace, "-o", "jsonpath={.items[0].status.conditions[?(@.type=='Ready')].status}") output, err := utils.Run(cmd) g.Expect(err).NotTo(HaveOccurred()) g.Expect(output).To(Equal("True"), "Controller manager pod not ready (webhook server may not be accepting connections)") cmd = exec.Command("kubectl", "get", "endpoints", webhookServiceName, "-n", namespace, "-o", "jsonpath={.subsets[*].addresses[*].ip}") output, err = utils.Run(cmd) g.Expect(err).NotTo(HaveOccurred()) g.Expect(output).NotTo(BeEmpty(), "Webhook service endpoints are not ready") cmd = exec.Command("kubectl", "get", "--raw", fmt.Sprintf("/api/v1/namespaces/%s/services/https:%[2]s:https/proxy/readyz", namespace, webhookServiceName)) _, err = utils.Run(cmd) g.Expect(err).NotTo(HaveOccurred(), "Webhook server not responding on port 443") } Eventually(verifyWebhookServiceReady, 2*time.Minute).Should(Succeed())

Thanks you, this looks Perfect :)

Regarding this I have slightly modified. Since now we have the marker which runs exclusively for webhooks, instead of ignoring the error, will be failing if service is not found.

const webhookServiceName = "{{ProjectName}}-webhook-service" cmd := exec.Command("kubectl", "get", "service", webhookServiceName, "-n", namespace) _, err := utils.Run(cmd) if err != nil { // Project has no webhook service; nothing to wait on. return }

to

const webhookServiceName = "{{.ProjectName}}-webhook-service" // Webhook service should exist since webhooks are configured cmd := exec.Command("kubectl", "get", "service", webhookServiceName, "-n", namespace) _, err := utils.Run(cmd) g.Expect(err).NotTo(HaveOccurred(), "Webhook service should exist but was not found")

Regarding this

cmd = exec.Command("kubectl", "get", "--raw", fmt.Sprintf("/api/v1/namespaces/%s/services/https:%[2]s:https/proxy/readyz", namespace, webhookServiceName)) _, err = utils.Run(cmd) g.Expect(err).NotTo(HaveOccurred(), "Webhook server not responding on port 443")

Looks like it is failing after using the service proxy to check for readiness here: https://github.com/mayuka-c/kubebuilder/actions/runs/18962257859/job/54151953303.

I will see on what could be the reason for this. Please do let me know if the endpoint used is wrong. Thanks!

If we use a pod with curl to do so is fine but we need to cleanup
That is why I thought in some other alternative option like.

Unfortunately I reverted to use curl, it does remove the pod once it finishes with --rm flag. Please do let me know if it is fine

We need to ensure that all that we create is cleaned.
Use -rm might be a good option if the pod is only removed after the check.
It seems it might be a good approach I will check it out.
Thank you a lot for looking on that 🚀

Well done for sure.

Fix flakes - 2 Run make generate RUn make generate Minor fix Minor fix Fix-2 Fix-3 Implement a new marker for webhook readiness Lint fix scaffold md fix Address comment-2 Minor fix Test-1 Fix-2 Add error debug print Revert change

camilamacedo86

Hey @mayuka-c

That is great. But I think we could improve for

By("waiting for webhook service to be ready")
verifyWebhookServiceReady := func(g Gomega) {
	const webhookServiceName = "project-v4-multigroup-webhook-service"

	// 1) Ensure the webhook Service exists
	cmd := exec.Command("kubectl", "get", "service", webhookServiceName, "-n", namespace)
	_, err := utils.Run(cmd)
	g.Expect(err).NotTo(HaveOccurred(), "Webhook service should exist but was not found")

	// 2) Wait until the controller-manager Pod is Ready
	cmd = exec.Command(
		"kubectl", "wait", "pod",
		"-l", "control-plane=controller-manager",
		"-n", namespace,
		"--for=condition=Ready",
		"--timeout=90s",
	)
	_, err = utils.Run(cmd)
	g.Expect(err).NotTo(HaveOccurred(), "Controller manager pod not ready (webhook server may not be accepting connections)")

	// 3) Verify webhook Service endpoints are ready using EndpointSlice (modern clusters)
	cmd = exec.Command(
		"kubectl", "get", "endpointslices",
		"-l", "kubernetes.io/service-name="+webhookServiceName,
		"-n", namespace,
		"-o", "jsonpath={.items[*].endpoints[?(@.conditions.ready==true)].addresses[*]}",
	)
	output, err := utils.Run(cmd)
	g.Expect(err).NotTo(HaveOccurred(), "Failed to query EndpointSlices for the webhook service")
	g.Expect(strings.TrimSpace(output)).NotTo(BeEmpty(),
		"Webhook service has no ready endpoints (EndpointSlice has zero ready addresses)")

	// 4) Test webhook connectivity using an ephemeral curl pod in the same namespace
	cmd = exec.Command(
		"kubectl", "run", "webhook-test",
		"--rm", "-i", "--restart=Never",
		"-n", namespace, // run the curl pod in the test namespace
		"--image=curlimages/curl:latest", "--",
		"curl", "-ksSf", "--connect-timeout", "5",
		"--retry", "3", "--retry-delay", "1",
		"https://"+webhookServiceName+"."+namespace+".svc:443/readyz",
	)
	_, err = utils.Run(cmd)
	g.Expect(err).NotTo(HaveOccurred(),
		"Webhook server not responding or /readyz endpoint unhealthy on port 443")
}

Eventually(verifyWebhookServiceReady, 2*time.Minute).Should(Succeed())
// +kubebuilder:scaffold:e2e-webhooks-readiness

Then, following some questions:

Do we need to keep // +kubebuilder:scaffold:e2e-webhooks-readiness should not that be replaced by the content?

camilamacedo86 · 2025-11-10T05:28:50Z

testdata/project-v4-multigroup/test/e2e/e2e_test.go

+				g.Expect(err).NotTo(HaveOccurred(), "Webhook server not responding on port 443")
+			}
+			Eventually(verifyWebhookServiceReady, 2*time.Minute).Should(Succeed())
+			// +kubebuilder:scaffold:e2e-webhooks-readiness


We need to ensure that all that we create is cleaned.
Use -rm might be a good option if the pod is only removed after the check.
It seems it might be a good approach I will check it out.
Thank you a lot for looking on that 🚀

Well done for sure.

camilamacedo86 · 2025-11-10T05:33:09Z

docs/book/src/cronjob-tutorial/testdata/project/test/e2e/e2e_test.go

+					"curl", "-k", "--connect-timeout", "5",
+					"https://"+webhookServiceName+"."+namespace+".svc:443/readyz")
+				_, err = utils.Run(cmd)
+				g.Expect(err).NotTo(HaveOccurred(), "Webhook server not responding on port 443")


-n namespace → ensures the curl pod runs and cleans up in the test namespace.

-ksSf → curl fails if it gets a non-2xx HTTP response, is silent otherwise, and ignores TLS verification errors.

--retry 3 --retry-delay 1 → helps handle transient webhook startup delays.

Improved error message → clearer context if the readiness check fails.

cmd = exec.Command( "kubectl", "run", "webhook-test", "--rm", "-i", "--restart=Never", "-n", namespace, // ensure the pod runs in the same namespace "--image=curlimages/curl:latest", "--", "curl", "-ksSf", "--connect-timeout", "5", "--retry", "3", "--retry-delay", "1", "https://"+webhookServiceName+"."+namespace+".svc:443/readyz", ) _, err = utils.Run(cmd) g.Expect(err).NotTo(HaveOccurred(), "Webhook server not responding or /readyz endpoint unhealthy on port 443")

camilamacedo86 · 2025-11-10T05:36:00Z

docs/book/src/cronjob-tutorial/testdata/project/test/e2e/e2e_test.go

+					"-n", namespace, "-o", "jsonpath={.subsets[*].addresses[*].ip}")
+				output, err = utils.Run(cmd)
+				g.Expect(err).NotTo(HaveOccurred())
+				g.Expect(output).NotTo(BeEmpty(), "Webhook service endpoints are not ready")


I think we should avoid use endpoints.
It seems deprecated

cmd = exec.Command( "kubectl", "get", "endpointslices", "-l", "kubernetes.io/service-name="+webhookServiceName, "-n", namespace, "-o", "jsonpath={.items[*].endpoints[?(@.conditions.ready==true)].addresses[*]}", ) output, err := utils.Run(cmd) g.Expect(err).NotTo(HaveOccurred(), "Failed to query EndpointSlices for the webhook service") g.Expect(strings.TrimSpace(output)).NotTo(BeEmpty(), "Webhook service has no ready endpoints (EndpointSlice has zero ready addresses)")

camilamacedo86 · 2025-11-10T05:36:42Z

docs/book/src/cronjob-tutorial/testdata/project/test/e2e/e2e_test.go

+				output, err := utils.Run(cmd)
+				g.Expect(err).NotTo(HaveOccurred())
+				g.Expect(output).To(Equal("True"),
+					"Controller manager pod not ready (webhook server may not be accepting connections)")


We could use --for=condition=Ready

Copilot

Pull Request Overview

This PR adds webhook readiness checks to e2e tests to ensure webhook services are fully operational before proceeding with metrics collection. This helps prevent flaky test failures caused by webhook services not being ready when subsequent test operations are executed.

Adds a new e2e-webhooks-readiness scaffold marker for injecting webhook readiness verification
Implements comprehensive webhook readiness checks including service existence, pod readiness, endpoint availability, and connectivity tests
Updates documentation to describe the new scaffold marker

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
pkg/plugins/golang/v4/scaffolds/internal/templates/test/e2e/test.go	Adds webhook readiness check template and logic to inject it at the new marker location
docs/book/src/reference/markers/scaffold.md	Documents the new `e2e-webhooks-readiness` scaffold marker
testdata/project-v4/test/e2e/e2e_test.go	Adds webhook readiness verification before metrics collection
testdata/project-v4-with-plugins/test/e2e/e2e_test.go	Adds webhook readiness verification before metrics collection
testdata/project-v4-multigroup/test/e2e/e2e_test.go	Adds webhook readiness verification before metrics collection
docs/book/src/multiversion-tutorial/testdata/project/test/e2e/e2e_test.go	Adds webhook readiness verification before metrics collection
docs/book/src/getting-started/testdata/project/test/e2e/e2e_test.go	Adds scaffold marker placeholder for webhook readiness checks
docs/book/src/cronjob-tutorial/testdata/project/test/e2e/e2e_test.go	Adds webhook readiness verification before metrics collection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-10T05:52:36Z

pkg/plugins/golang/v4/scaffolds/internal/templates/test/e2e/test.go

+					"-n", namespace, "-o", "jsonpath={.items[0].status.conditions[?(@.type=='Ready')].status}")
+				output, err := utils.Run(cmd)
+				g.Expect(err).NotTo(HaveOccurred())
+				g.Expect(output).To(Equal("True"), 


Trailing whitespace detected at the end of this line. This should be removed for consistency with code style standards.

Suggested change

g.Expect(output).To(Equal("True"),

g.Expect(output).To(Equal("True"),

Copilot · 2025-11-10T05:52:37Z

pkg/plugins/golang/v4/scaffolds/internal/templates/test/e2e/test.go

+					"Controller manager pod not ready (webhook server may not be accepting connections)")
+
+				// Check if webhook service endpoints are available
+				cmd = exec.Command("kubectl", "get", "endpoints", webhookServiceName, 


Trailing whitespace detected at the end of this line. This should be removed for consistency with code style standards.

Suggested change

cmd = exec.Command("kubectl", "get", "endpoints", webhookServiceName,

cmd = exec.Command("kubectl", "get", "endpoints", webhookServiceName,

Copilot · 2025-11-10T05:52:37Z

pkg/plugins/golang/v4/scaffolds/internal/templates/test/e2e/test.go

+				// Test webhook connectivity by checking if webhook server port is responding
+				cmd = exec.Command("kubectl", "run", "webhook-test", "--rm", "-i", "--restart=Never",
+					"--image=curlimages/curl:latest", "--",
+					"curl", "-k", "--connect-timeout", "5", 


Trailing whitespace detected at the end of this line. This should be removed for consistency with code style standards.

Suggested change

"curl", "-k", "--connect-timeout", "5",

"curl", "-k", "--connect-timeout", "5",

Copilot · 2025-11-10T05:52:37Z

pkg/plugins/golang/v4/scaffolds/internal/templates/test/e2e/test.go

+				cmd = exec.Command("kubectl", "run", "webhook-test", "--rm", "-i", "--restart=Never",
+					"--image=curlimages/curl:latest", "--",


The hardcoded pod name "webhook-test" could cause conflicts when Eventually retries this function. If a previous attempt fails after the pod is created but before it's deleted (e.g., due to curl failure), subsequent attempts will fail with "pod already exists" errors. Consider using a unique name with a timestamp suffix or using --generate-name flag instead.

Suggested change

cmd = exec.Command("kubectl", "run", "webhook-test", "--rm", "-i", "--restart=Never",

"--image=curlimages/curl:latest", "--",

cmd = exec.Command("kubectl", "run", "--rm", "-i", "--restart=Never",

"--generate-name=webhook-test-", "--image=curlimages/curl:latest", "--",

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 17, 2025

k8s-ci-robot requested review from Kavinjsir and camilamacedo86 October 17, 2025 05:10

mayuka-c changed the title ~~🌱 Fixing flakes in e2e tests for metrics~~ 🌱 Fix flakes in e2e tests for metrics Oct 17, 2025

mayuka-c force-pushed the issues-5137 branch 2 times, most recently from f130aac to 1c32309 Compare October 21, 2025 01:59

camilamacedo86 requested changes Oct 21, 2025

View reviewed changes

mayuka-c changed the title ~~🌱 Fix flakes in e2e tests for metrics~~ 🐛 Fix flakes in e2e tests for metrics Oct 21, 2025

mayuka-c force-pushed the issues-5137 branch 2 times, most recently from b703a44 to d058ca9 Compare October 30, 2025 12:07

mayuka-c requested a review from camilamacedo86 October 30, 2025 12:18

camilamacedo86 reviewed Oct 30, 2025

View reviewed changes

docs/book/src/reference/markers/scaffold.md Outdated Show resolved Hide resolved

mayuka-c force-pushed the issues-5137 branch from d058ca9 to 25f370b Compare October 30, 2025 14:30

camilamacedo86 reviewed Oct 30, 2025

View reviewed changes

camilamacedo86 self-requested a review October 30, 2025 15:07

mayuka-c force-pushed the issues-5137 branch 5 times, most recently from 05fc492 to 014349e Compare November 2, 2025 15:02

[e2e] Fixing flakes in multigroup serve metrics

237b9af

Fix flakes - 2 Run make generate RUn make generate Minor fix Minor fix Fix-2 Fix-3 Implement a new marker for webhook readiness Lint fix scaffold md fix Address comment-2 Minor fix Test-1 Fix-2 Add error debug print Revert change

mayuka-c force-pushed the issues-5137 branch from 014349e to 237b9af Compare November 2, 2025 17:11

camilamacedo86 requested changes Nov 10, 2025

View reviewed changes

camilamacedo86 requested a review from Copilot November 10, 2025 05:42

Copilot AI reviewed Nov 10, 2025

View reviewed changes

	g.Expect(output).To(Equal("True"),
	g.Expect(output).To(Equal("True"),

	cmd = exec.Command("kubectl", "get", "endpoints", webhookServiceName,
	cmd = exec.Command("kubectl", "get", "endpoints", webhookServiceName,

	"curl", "-k", "--connect-timeout", "5",
	"curl", "-k", "--connect-timeout", "5",

		cmd = exec.Command("kubectl", "run", "webhook-test", "--rm", "-i", "--restart=Never",
		"--image=curlimages/curl:latest", "--",

🐛 Fix flakes in e2e tests for metrics #5138

Are you sure you want to change the base?

🐛 Fix flakes in e2e tests for metrics #5138

Conversation

mayuka-c commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Validation

Uh oh!

k8s-ci-robot commented Oct 17, 2025

Uh oh!

k8s-ci-robot commented Oct 17, 2025

Uh oh!

camilamacedo86 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayuka-c commented Oct 21, 2025

Uh oh!

mayuka-c commented Oct 21, 2025

Uh oh!

camilamacedo86 commented Oct 21, 2025

Uh oh!

mayuka-c commented Oct 22, 2025

Uh oh!

mayuka-c commented Oct 30, 2025

Uh oh!

Uh oh!

camilamacedo86 Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayuka-c Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

camilamacedo86 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

mayuka-c commented Oct 17, 2025 •

edited

Loading

camilamacedo86 left a comment •

edited

Loading

camilamacedo86 Oct 30, 2025 •

edited

Loading

mayuka-c Oct 31, 2025 •

edited

Loading